GEODESIC SINKHORN FOR FAST AND ACCURATE OPTIMAL TRANSPORT ON MANIFOLDS

Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state-of-the-art for such computations, but require On2 computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn—based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only O(nlog⁡n) computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular, we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.


INTRODUCTION
Optimal Transport (OT) distances or Wasserstein distances are computed by lifting ground distances between points to distances between measures.This distance is computed relative to a ground distance on the support of the distributions, making it more informative than distances based only on a pointwise comparison of the densities.However, to compute the Wasserstein, one needs to find the optimal transport plan from the source distribution to a target distribution; this is a linear programming problem requiring O(n 3 log n) for discrete distributions of size n [1].
An efficient modification of the optimal transport problem is to consider entropy-regularized transportation.This formulation is solved with the Sinkhorn algorithm [2] by iteratively rescaling a Gaussian kernel based on the distance matrix.It is equivalent to the Schrödinger Bridge problem, for which similar algorithms were developed [3]- [5].In the discrete case, it requires O(n 2 ) for distributions of size n, since it relies on matrix-vector products.Furthermore, this formulation allows for fast computation of the discrete barycenter with fixed support (the average distributions w.r.t. the Sinkhorn distance).An important drawback of the Sinkhorn algorithm is the necessity of storing and multiplying the pairwise distance matrix with a vector.
Additionally, the ground distance is commonly chosen as the Euclidean distance.The Euclidean distance is often suboptimal for high-dimensional datasets over larger distances according to the manifold hypothesis, which says observations lie near a low dimensional (curved) manifold in high dimensional space [6].For higher dimensional datasets assumed to be sampled from a lower dimensional manifold, using a distance closer to the manifold for OT has shown promising results [7]- [10].
In this work, we present Geodesic Sinkhorn 1 ; a Sinkhornbased method for fast optimal transport with a heat-geodesic ground distance.Our method is based on the geometry of the dataset constructed from a common graph and uses the heat kernel on the graph to defined a heat-geodesic distance.Key to this approach, we will never need to construct or operate on an n × n distance matrix, and we will only use the sparse Laplacian matrix and sparse matrix-vector products.For sparse graphs, this can be used for O(n log n) computation of the Sinkhorn distance with a manifold ground distance, improving on the O(n 2 ) implementations based on dense matrices.
Increasing the state-of-the art efficiency in Sinkhorn computation opens us up to being able to perform complex operations on large groups of datasets.In particular, we consider interpolating between datasets and show that using our heat-geodesic distance improves the interpolation accuracy compared to OT with Euclidean distance.The barycenter corresponds to the average distribution of a set of distributions.Our method allows for finer-grained barycenters on a data manifold, which motivates us to define a novel notion of dissimilarity between families of distributions called barycentric distance.
We apply the barycentric distance to single cell data from patient-derived cancer organoids (PDOs) to assess the effect of treatments (such as drugs and chemotherapy).Here we have one set of PDOs from control conditions, and another set that are treated.The treatment effect is thus the distance between these barycenters.In addition, we use Geodesic Sinkhorn's barycenter to compare the effect from one family of distributions to another.
Our main contributions include: (1) A new method for computing optimal transport distances on a manifold called Geodesic Sinkhorn, which is highly efficient in time and memory.(2) Defining the barycentric distance; a novel distance between families of distributions, and showing its utility in deriving treatment effect from control and treated patient samples.

RELATED WORK
Geodesic Sinkhorn is related to prior work linking the entropyregularized optimal transport problem triangular mesh with the heat operator [8], [11], but using different graph filtering techniques.These approaches approximate the application of the heat kernel to a vector by discretizing the heat equation and solving systems of linear equations.This technique was used in different contexts, either with the cotangent Laplacian [8] or to learn a ground metric [12].Solving these systems for each Sinkhorn iteration can be done efficiently with a sparse Cholesky decomposition.However, this method's efficiency depends mainly on the efficiency of the Cholesky decomposition which can be slow depending on the sparsity pattern is O(n 3 ) for an n × n matrix, and necessitates solving 2K systems of linear equations per Sinkhorn iteration, where K is the number of sub-steps in the backward Euler discretization.

PRELIMINARIES
In this section, we start by reviewing the basics of OT and the Wasserstein distance, as well the Sinkhorn distance.Then we review two notions fundamental to our method; the heat equation on a graph and the Chebyshev approximation of the heat kernel.

Wasserstein Distance
In the following, we assume that all distributions admit a density or a probability mass function, and we use the same notation for both.Let µ, ν be two probability distributions on a measurable space X ⊆ R d with metric d(•, •), let Π(µ, ν) be the set of joint probability distributions π on the space X × X where, for any measurable subset ω ⊂ X , π(ω × X ) = µ(ω) and π(X × ω) = ν(ω).The p-Wasserstein distance is defined as: In the following, we consider p = 2.An exact algorithm based on linear programming can solve this problem in O(n 3 log n) time for discrete distributions of size n.

Sinkhorn Distances
The Kullback-Leibler (KL) divergence between π and some strictly positive K on X × X is defined as The Sinkhorn distance2 is a relaxation of equation 1 where the infimum is over all coupling in {π ∈ Introduced in [13], the optimization of this distance can be solved by considering the entropyregularized transport where we define the entropy of a coupling π as H(π) := − ln π(x, y)dπ(x, y), and λ > 0. This formulation converges to the Wasserstein distance as λ → 0, and can be solved with the Sinkhorn algorithm with complexity of the order O(n 2 /ϵ) for discrete distributions of size n [13].In the discrete case, the transport matrix π admits the form diag(v)K λ diag(w), where v, w are vectors of size n.The Sinkhorn algorithm iteratively updates the vectors as (v, w) ← (µ./K λ w, ν./K ′ λ v), where K λ := e −d(x,y) 2 /λ .Following [8], using the kernel K λ gives an alternative interpretation of the Sinkhorn distance as The problem in equation 3 is strictly convex and continuous yielding a unique minimizer.In the discrete case, this leads to an algorithm for the entropy-regularized Wasserstein distance based on the Sinkhorn algorithm enforcing the marginal constraints on the kernel K λ while minimizing the distance as quantified by D KL .
The underlying metric d(•, •) is generally unknown, thus the kernel K λ cannot be evaluated.The authors of [8] proposed to approximate K λ with the heat kernel H t (x, y) on X .
According to Varadhan's formula [14], the geodesic distance on a manifold can be recovered from the heat transfer at small timescales as Hence, motivating the use of the heat-geodesic distance d 2 H (x, y) := −4t ln H t (x, y), with associated kernel K λ (x, y) = H λ/4 (x, y).Interestingly, Sinkhorn-based methods admit an efficient algorithm to solve the barycenter problem which we present next.

Interpolation with discrete support
By constraining the support to a set X (or a graph), we can efficiently interpolate between more than two distributions.The barycenter problem [1], [8], [15] generalizes the notion of average between points to an average between distributions.For a set of m distributions {µ 1 , . . ., µ m } supported on X , the objective is to find a distribution minimizing the average distance where P(X ) denotes the space of probability distributions supported on X , and {α 1 , . . ., α m } are non-negative weights.
Finding the barycenter is a challenging optimization problem, however the barycenter for Sinkhorn-based methods admits an efficient computation.It involves updating m vectors v i , w i , which define a transport plan from µ i to the barycenter µ * .The support of the barycenter is constrained to X , for most Sinkhorn-based methods the size of X needs to be small for computational reason.Our method does not suffer from such a limitation.Hence, we can consider barycenter with greater expressivity, and interpolate between large sets of distributions.

Heat Diffusion on a Graph
Consider an undirected graph G = (V, E) with a set V of n vertices and a set of edges E, and its weighted adjacency matrix A with non-negative edge weights, and the diagonal degree matrix D, where D ii := k A ik .We define the combinatorial Laplacian as L := D − A, for any function The combinatorial Laplacian is a symmetric positive semi-definite matrix, and has an eigendecomposition L = ΨΛΨ T with orthonormal eigenvectors Ψ and diagonal eigenvalue matrix The combinatorial Laplacian is a natural extension of the negative of the Laplacian operator to a graph.For a signal f 0 ∈ R n on G, the diffusion of f 0 on the graph evolves according to the heat equation The heat kernel solves this ODE, it is defined by the matrix exponential H t := e −tL .By orthogonality of the eigenvectors of L, we can write H t = Ψe −tΛ Ψ T and f (t) = H t f 0 .Computing H t by eigendecomposition would require O(n 3 ) operations.Recall that, for the Sinkhorn algorithm, we are only concerned with the application of the heat operator H t on a signal f ∈ R n .For larger diffusion time, the heat kernel converges to its eigenvector associated to the lowest eigenvalues of the Laplacian, hence, intuitively, the heat kernel corresponds to a low-pass filter.In Geodesic Sinkhorn, we use Chebyshev polynomials [16], [17] to approximate the application of the heat operator to a signal.For a short timescale t, using the heat kernel accounts for using the geodesic distance as ground distance in the entropy-regularized OT formulation equation 3.

Chebyshev Polynomials
Polynomial sequences are often used to approximate functions or operator.With Chebyshev polynomials, we can approximate the application of the matrix exponential H t = e −tL to a signal f on the graph.An attractive property of Chebyshev polynomials is that the approximation error decays exponentially with the maximum degree K.They are defined by the recursive relation {T k } k∈N with T 0 (y) = 0, T 1 (y) = y and where the K + 1 scalar coefficient {b k } depend on time and can be evaluated with the Bessel function.The approximation of H t is based on the first K term of the series which we note p K (L, t).It results in K matrix-vector products which can be efficient since, in general, L is a sparse matrix.On a m-nearest neighbor graph, this can be O(Kmn/λ), where λ is a regularization parameter.Chebyshev polynomials admits interesting theoretical properties and are known to converge faster than other polynomials [17], [18].The choice of the parameter K is related to the number of neighbors or the connectivity of the graph.For small diffusion time, hence only diffusing in a local neighborhood, the approximation is accurate even with a small K.As the diffusion time increases, K has to increase in order to consider a larger neighborhood around a node.For OT, we consider small diffusion time, and we found that our results were stable for all K greater than 10.

GEODESIC SINKHORN DISTANCES
We define the Geodesic Sinkhorn distance between any signals or distributions on a graph G by the entropy-regularized OT with the heat kernel H t on the graph.This construction is Algorithm 1 Geodesic Sinkhorn Input: n × n Laplacian L, distributions (signals) µ, ν on G, maximum Chebyshev degree K, regularization λ, a vertices weights. Output: also valid between any point cloud datasets.In that case, for m datasets {X 1 , . . ., X m } sampled from a set of distributions {µ 1 , . . ., µ m }, we construct a common graph using an affinity kernel on the m datasets and compare two samples by taking the distance between two indicator functions on the graph.We approximate the heat kernel H t with Chebyshev polynomials p K (L, l) of order K.In Algorithm 1, we present the main steps to evaluate the Geodesic Sinkhorn.It is based upon Sinkhorn iterations [2], [13], where ⊘ and ⊙ denote respectively the elementwise division and multiplication.Note that, as opposed to the usual Sinkhorn algorithm, we never have to store a dense n × n distance matrix, but only the usually sparse graph Laplacian.In the following proposition, we find the ground distance implicitly used in the optimal transport defined by Geodesic Sinkhorn.We use ≃ for the equivalence relation between distances.

Proposition 4.2.
There exists a maximum Chebyshev polynomial degree K such that the ground distance in Geodesic Sinkhorn is equivalent to the one based In particular, the Wasserstein distances with these ground distances are equivalent.
Proof.Because the approximation error decreases exponentially in K [17], we have that for any ϵ > 0 sufficiently small there exist Choose K such that this is true for all vertices K := max{K 1 . . ., K n }.We define and we have the equivalence between the distances since and since the logarithm is a monotonic function.
In [8], [12], using the Euler implicit discretization results in a ground cost of the form −ϵ ln((Id − ϵ 4K L) −K ), where Id is the identity matrix, and can be seen as another approximation for the matrix exponential.
The efficiency of Geodesic Sinkhorn improves the notion of barycenter as it is possible to consider much larger graph G, thus a finer grained support of the barycenter.This leads us to define a novel distance between families of distributions.Definition 4.3.For two finite families of distributions T and C supported on G, we define the barycentric distance between the families T , C as where µ * T , µ * C are respectively the barycenters of T and C. The previous definition is valid for any distances between distributions or barycenters.However, OT barycenters are known to be more informative than others [15].We will further explore this comparison in our experiments.We use it to distinguish between two groups in a medical setting where a set of patients received a treatment (defining the family T ), and another set acts as a control family C. Following this idea, we define a notion of effect between two families.C and Y t ∼ µ * T .Note that we compute the average on the family of distributions instead of the average on their support, hence we evaluate their expectations in a closed form.This definition also extends to a conditional equivalent where families of distribution can be subdivided with discrete covariate variables.When the barycenters are computed with the total variation, this definition is equivalent to the naive Average Treatment Effect(ATE) [19]; i.e. difference of empirical means.

RESULTS
We demonstrate the accuracy and efficiency of the Geodesic Sinkhorn distance on two tasks: (1) Nearest-Wassersteinneighbor calculation on simulated data with manifold structure similar to the setup of [10]; (2) A newly defined Barycentric distance between families of distributions computed to quantify the effect of a treatment on patient-derived organoids.In Appendix A.1, we present additional results on time series interpolation.

Nearest-Wasserstein-neighbor distributions
In this experiment, we compare our method with Sinkhorn [13], and LR Sinkhorn [20], both algorithms with Euclidean and squared Euclidean ground distance, with DiffusionEMD [21], and Sinkorn with Euler approximation of the heat filter.We created 15 Gaussian distributions sampled randomly on a swiss roll dataset, and sampled 10k observations from each distribution.We rotated the observations in 10 dimensions.We consider a k-nearest neighbors task on these distributions.We evaluate the methods with the ground truth, since we know the exact geodesic distance on the manifold.In Tab. 1, we report the average and standard deviation over 10 seeds of the Spearman and Pearson correlations to the ground truth, and the runtime in seconds with and without the computation of the graph.Our method is the most accurate while being much faster than other Sinkhorn-based methods.

Barycentric distance
We test if we can identify a linear treatment effect with the Expected Barycenter Effect (EBE).In this experiment, we create a control family of distributions C of ten standard Gaussian distributions.The treatment group consists of nine Gaussian distributions N (5, 1), and one outlier centered at different means.For each distribution, we sample 500 observations, and reproduce the experiment over ten seeds.In Tab. 2, we report the EBE and its standard deviation with the Geodesic Sinkhorn, the Total Variation (TV) distance, and Sinkhorn.
Since the TV only compares the mean, it is sensitive to the outlier, whereas our method can identify the true treatment effect.

Single-cell signaling data
We use single-cell signaling data produced by mass cytometry (MC) for a screening study to compare the treatment effect of different chemotherapies on 10 colorectal cancer (CRC) patient-derived organoids (PDOs) [22].These PDOs can be grouped into chemoresistant PDOs, that show little-to-no effect when treated with chemotherapies; and chemosensitive PDOs, that present strong shifts in their phenotypes upon treatment.The observations include single-cell data information on the cell cycle and signaling changes upon treatment of PDOs with different CRC treatments at a range of concentrations.In Fig. 1, we present the barycentric distances matrices between treatments a) and between four concentrations of treatment SN-38 (S) b).In both cases, the control groups corresponds to AH and DMSO, the two rightmost columns.We compare the distance matrices between Sinkhorn (left) and our method (right).Our method provide a finer distinction between treatments (Fig. 1 top) and concentrations (Fig. 1 bottom), especially for the chemosensitive group.As observed in [22], chemosensitive PDOs show little-to-no response to lower concentrations of SN-38 (S1), but their phenotype shifts very strongly upon treatment with higher concentrations (S2, S3, and S4) (Fig. 1 b).When comparing combinations of different treatments (Fig. 1 a), Geodesic Sinkhorn better resolves the difference between SN-38 (S) alone and in combination with Cetuximab (C), showing that S is the main agent creating the treatment effect and the combination with C does not resolve in a synergistic effect [22].Note that we only consider the relative magnitude of the distances, since the two algorithm use different ground distances.

CONCLUSION
In this work, we considered the use of OT for graphs and large datasets in high dimensions potentially sampled from a lower dimensional manifold.We proposed Geodesic Sinkhorn, a fast implementation of the Sinkhorn algorithm using the graph Laplacian and Chebyshev polynomials.Our method is well adapted for large and high dimensional datasets as it is defined with a geodesic ground distance, which takes into account the underlying geometry of the data, and requires less computation time and less memory.On a synthetic dataset, we showed that Geodesic Sinkhorn is much faster than other Sinkhorn-based methods while being more accurate.With the Wasserstein barycenter, we defined the barycentric distance to compare entire families of distributions, and the expected barycenter effect, then applied both methods to a large PDO drug screen dataset.

Definition 4 . 4 .
For two family of distributions T and C supported on G, define the Expected Barycenter Effect of T as τ (T ) := E µ * T (Y t ) − E µ * C (Y c ), where µ * T , µ * C are respectively the barycenters of T and C, and the features Y c ∼ µ *

Fig. 1 .
Fig. 1. a) Barycentric distances matrices for the Sinkhorn algorithm (left) and our method Geodesic Sinkhorn (right).b) Barycentric distances matrices between doses of treatment SN-38, for four concentrations S1 S2 S3 S4.Control groups correspond to AH and DSMO.Geodesic Sinkhorn provides a clearer distinction between treatments, and doses.

Table 1 .
KNN task for 15 distributions, best score highlighted is bold.Geodesic Sinkhorn is the most accurate, while being faster than other Sinkhorn-based methods.